{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 使用HMM进行词性标注\n",
    "这里我们用NLTK自带的Brown词库进行学习。\n",
    "\n",
    "算了，稍微说几句吧。\n",
    "\n",
    "假设我们的单词集： words = w1 ... wN\n",
    "\n",
    "Tag集： tags = t1 ... tN\n",
    "\n",
    "P(tags | words) 正比于 P(ti | t{i-1}) * P(wi | ti)\n",
    "\n",
    "为了找一个句子的tag，\n",
    "\n",
    "我们其实就是找的最好的一套tags，让他最能够符合给定的单词(words)。\n",
    "\n",
    "首先，\n",
    "\n",
    "## 导入需要的库"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package brown to\n",
      "[nltk_data]     C:\\Users\\mantch\\AppData\\Roaming\\nltk_data...\n",
      "[nltk_data]   Unzipping corpora\\brown.zip.\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "import sys\n",
    "nltk.download('brown')\n",
    "from nltk.corpus import brown"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 预处理词库\n",
    "这里需要做的预处理是：给词们加上开始和结束符号。\n",
    "\n",
    "Brown里面的句子都是自己标注好了的，长这个样子：(I , NOUN), (LOVE, VERB), (YOU, NOUN)\n",
    "\n",
    "那么，我们的开始符号也得跟他的格式符合，\n",
    "\n",
    "我们用：\n",
    "\n",
    "(START, START) (END, END)\n",
    "\n",
    "来代表"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "brown_tags_words = []\n",
    "for sent in brown.tagged_sents():\n",
    "    # 先加开头\n",
    "    brown_tags_words.append((\"START\", \"START\"))\n",
    "    # 为了省事儿，我们把tag都省略成前两个字母\n",
    "    brown_tags_words.extend([(tag[:2], word) for (word, tag) in sent])\n",
    "    # 加个结尾\n",
    "    brown_tags_words.append((\"END\", \"END\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 词统计\n",
    "这个时候，我们要把我们所有的词库中拥有的单词与tag之间的关系，做个简单粗暴的统计。\n",
    "\n",
    "也就是我们之前说过的：\n",
    "\n",
    "P(wi | ti) = count(wi, ti) / count(ti)\n",
    "\n",
    "你可以自己一个个的loop全部的corpus，\n",
    "\n",
    "当然，这里NLTK给了我们做统计的工具，（这属于没有什么必要的hack，装起逼来也不X，所以，大家想自己实现，可以去实现，不想的话，就用这里提供的方法）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# conditional frequency distribution\n",
    "cfd_tagwords = nltk.ConditionalFreqDist(brown_tags_words)\n",
    "# conditional probability distribution\n",
    "cpd_tagwords = nltk.ConditionalProbDist(cfd_tagwords, nltk.MLEProbDist)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "好，现在我们看看平面统计下来的结果："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The probability of an adjective (JJ) being 'new' is 0.01472344917632025\n",
      "The probability of a verb (VB) being 'duck' is 6.042713350943527e-05\n"
     ]
    }
   ],
   "source": [
    "print(\"The probability of an adjective (JJ) being 'new' is\", cpd_tagwords[\"JJ\"].prob(\"new\"))\n",
    "print(\"The probability of a verb (VB) being 'duck' is\", cpd_tagwords[\"VB\"].prob(\"duck\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "好，接下来，还有第二个公式需要计算：\n",
    "\n",
    "P(ti | t{i-1}) = count(t{i-1}, ti) / count(t{i-1})\n",
    "\n",
    "这个公式跟words没有什么卵关系。它是属于隐层的马尔科夫链。\n",
    "\n",
    "所以 我们先取出所有的tag来。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "brown_tags = [tag for (tag, word) in brown_tags_words ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# count(t{i-1} ti)\n",
    "# bigram的意思是 前后两个一组，联在一起\n",
    "cfd_tags= nltk.ConditionalFreqDist(nltk.bigrams(brown_tags))\n",
    "# P(ti | t{i-1})\n",
    "cpd_tags = nltk.ConditionalProbDist(cfd_tags, nltk.MLEProbDist)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "好的，可以看看效果了："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "If we have just seen 'DT', the probability of 'NN' is 0.5057722522030194\n",
      "If we have just seen 'VB', the probability of 'JJ' is 0.016885067592065053\n",
      "If we have just seen 'VB', the probability of 'NN' is 0.10970977711020183\n"
     ]
    }
   ],
   "source": [
    "print(\"If we have just seen 'DT', the probability of 'NN' is\", cpd_tags[\"DT\"].prob(\"NN\"))\n",
    "print( \"If we have just seen 'VB', the probability of 'JJ' is\", cpd_tags[\"VB\"].prob(\"DT\"))\n",
    "print( \"If we have just seen 'VB', the probability of 'NN' is\", cpd_tags[\"VB\"].prob(\"NN\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 一些有趣的结果：\n",
    "那么，比如， 一句话，\"I want to race\"， 一套tag，\"PP VB TO VB\"\n",
    "\n",
    "他们之间的匹配度有多高呢？\n",
    "\n",
    "其实就是："
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "P(START) * P(PP|START) * P(I | PP) *\n",
    "            P(VB | PP) * P(want | VB) *\n",
    "            P(TO | VB) * P(to | TO) *\n",
    "            P(VB | TO) * P(race | VB) *\n",
    "            P(END | VB)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The probability of the tag sequence 'START PP VB TO VB END' for 'I want to race' is: 1.0817766461150474e-14\n"
     ]
    }
   ],
   "source": [
    "prob_tagsequence = cpd_tags[\"START\"].prob(\"PP\") * cpd_tagwords[\"PP\"].prob(\"I\") * \\\n",
    "    cpd_tags[\"PP\"].prob(\"VB\") * cpd_tagwords[\"VB\"].prob(\"want\") * \\\n",
    "    cpd_tags[\"VB\"].prob(\"TO\") * cpd_tagwords[\"TO\"].prob(\"to\") * \\\n",
    "    cpd_tags[\"TO\"].prob(\"VB\") * cpd_tagwords[\"VB\"].prob(\"race\") * \\\n",
    "    cpd_tags[\"VB\"].prob(\"END\")\n",
    "\n",
    "print( \"The probability of the tag sequence 'START PP VB TO VB END' for 'I want to race' is:\", prob_tagsequence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Viterbi 的实现\n",
    "如果我们手上有一句话，怎么知道最符合的tag是哪组呢？\n",
    "\n",
    "首先，我们拿出所有独特的tags（也就是tags的全集）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "distinct_tags = set(brown_tags)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后 随手找句话"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = [\"I\", \"want\", \"to\", \"race\" ]\n",
    "sentlen = len(sentence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，开始维特比：\n",
    "\n",
    "从1循环到句子的总长N，记为i\n",
    "\n",
    "每次都找出以tag X为最终节点，长度为i的tag链"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "viterbi = [ ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "同时，还需要一个回溯器：\n",
    "\n",
    "从1循环到句子的总长N，记为i\n",
    "\n",
    "把所有tag X 前一个Tag记下来。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "backpointer = [ ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'WD': 0.0, 'AP': 0.0, ',-': 0.0, 'RP': 0.0, 'EX': 0.0, 'AT': 0.0, 'MD': 0.0, '*': 0.0, 'NP': 1.7319067623793952e-06, ')': 0.0, '.-': 0.0, 'TO': 0.0, ':': 0.0, '--': 0.0, 'PP': 0.014930900689060006, '(-': 0.0, 'WQ': 0.0, 'NN': 1.0580313619573935e-06, 'CD': 0.0, 'FW': 0.0, 'QL': 0.0, '(': 0.0, '.': 0.0, '``': 0.0, ',': 0.0, \"''\": 0.0, 'IN': 0.0, 'RB': 0.0, 'OD': 0.0, 'CC': 0.0, '*-': 0.0, 'RN': 0.0, 'VB': 0.0, 'WR': 0.0, 'DT': 0.0, 'PN': 0.0, 'DO': 0.0, 'BE': 0.0, \"'\": 0.0, 'END': 0.0, 'CS': 0.0, ':-': 0.0, 'HV': 0.0, 'WP': 0.0, ')-': 0.0, 'NR': 0.0, 'NI': 3.3324520848931064e-07, 'JJ': 0.0, 'AB': 0.0, 'UH': 0.0}\n",
      "{'WD': 'START', 'AP': 'START', ',-': 'START', 'RP': 'START', 'EX': 'START', 'AT': 'START', 'MD': 'START', '*': 'START', 'NP': 'START', ')': 'START', '.-': 'START', 'TO': 'START', ':': 'START', '--': 'START', 'PP': 'START', '(-': 'START', 'WQ': 'START', 'NN': 'START', 'CD': 'START', 'FW': 'START', 'QL': 'START', '(': 'START', '.': 'START', '``': 'START', ',': 'START', \"''\": 'START', 'IN': 'START', 'RB': 'START', 'OD': 'START', 'CC': 'START', '*-': 'START', 'RN': 'START', 'VB': 'START', 'WR': 'START', 'DT': 'START', 'PN': 'START', 'DO': 'START', 'BE': 'START', \"'\": 'START', 'END': 'START', 'CS': 'START', ':-': 'START', 'HV': 'START', 'WP': 'START', ')-': 'START', 'NR': 'START', 'NI': 'START', 'JJ': 'START', 'AB': 'START', 'UH': 'START'}\n"
     ]
    }
   ],
   "source": [
    "first_viterbi = { }\n",
    "first_backpointer = { }\n",
    "for tag in distinct_tags:\n",
    "    # don't record anything for the START tag\n",
    "    if tag == \"START\": continue\n",
    "    first_viterbi[ tag ] = cpd_tags[\"START\"].prob(tag) * cpd_tagwords[tag].prob( sentence[0] )\n",
    "    first_backpointer[ tag ] = \"START\"\n",
    "\n",
    "print(first_viterbi)\n",
    "print(first_backpointer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "以上，是所有的第一个viterbi 和第一个回溯点。\n",
    "\n",
    "接下来，把楼上这些，存到Vitterbi和Backpointer两个变量里去"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "viterbi.append(first_viterbi)\n",
    "backpointer.append(first_backpointer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以先看一眼，目前最好的tag是啥："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Word 'I' current best two-tag sequence: START PP\n"
     ]
    }
   ],
   "source": [
    "currbest = max(first_viterbi.keys(), key = lambda tag: first_viterbi[ tag ])\n",
    "print( \"Word\", \"'\" + sentence[0] + \"'\", \"current best two-tag sequence:\", first_backpointer[ currbest], currbest)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "好的\n",
    "\n",
    "一些都清晰了\n",
    "\n",
    "我们开始loop："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Word 'want' current best two-tag sequence: PP VB\n",
      "Word 'to' current best two-tag sequence: VB TO\n",
      "Word 'race' current best two-tag sequence: IN NN\n"
     ]
    }
   ],
   "source": [
    "for wordindex in range(1, len(sentence)):\n",
    "    this_viterbi = { }\n",
    "    this_backpointer = { }\n",
    "    prev_viterbi = viterbi[-1]\n",
    "    \n",
    "    for tag in distinct_tags:\n",
    "        # START没有卵用的，我们要忽略\n",
    "        if tag == \"START\": continue\n",
    "        \n",
    "        # 如果现在这个tag是X，现在的单词是w，\n",
    "        # 我们想找前一个tag Y，并且让最好的tag sequence以Y X结尾。\n",
    "        # 也就是说\n",
    "        # Y要能最大化：\n",
    "        # prev_viterbi[ Y ] * P(X | Y) * P( w | X)\n",
    "        \n",
    "        best_previous = max(prev_viterbi.keys(),\n",
    "                            key = lambda prevtag: \\\n",
    "            prev_viterbi[ prevtag ] * cpd_tags[prevtag].prob(tag) * cpd_tagwords[tag].prob(sentence[wordindex]))\n",
    "\n",
    "        this_viterbi[ tag ] = prev_viterbi[ best_previous] * \\\n",
    "            cpd_tags[ best_previous ].prob(tag) * cpd_tagwords[ tag].prob(sentence[wordindex])\n",
    "        this_backpointer[ tag ] = best_previous\n",
    "    \n",
    "    # 每次找完Y 我们把目前最好的 存一下\n",
    "    currbest = max(this_viterbi.keys(), key = lambda tag: this_viterbi[ tag ])\n",
    "    print( \"Word\", \"'\" + sentence[ wordindex] + \"'\", \"current best two-tag sequence:\", this_backpointer[ currbest], currbest)\n",
    "\n",
    "\n",
    "    # 完结\n",
    "    # 全部存下来\n",
    "    viterbi.append(this_viterbi)\n",
    "    backpointer.append(this_backpointer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "找END，结束："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 找所有以END结尾的tag sequence\n",
    "prev_viterbi = viterbi[-1]\n",
    "best_previous = max(prev_viterbi.keys(),\n",
    "                    key = lambda prevtag: prev_viterbi[ prevtag ] * cpd_tags[prevtag].prob(\"END\"))\n",
    "\n",
    "prob_tagsequence = prev_viterbi[ best_previous ] * cpd_tags[ best_previous].prob(\"END\")\n",
    "\n",
    "# 我们这会儿是倒着存的。。。。因为。。好的在后面\n",
    "best_tagsequence = [ \"END\", best_previous ]\n",
    "# 同理 这里也有倒过来\n",
    "backpointer.reverse()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最终：\n",
    "\n",
    "回溯所有的回溯点\n",
    "\n",
    "此时，最好的tag就是backpointer里面的current best"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "current_best_tag = best_previous\n",
    "for bp in backpointer:\n",
    "    best_tagsequence.append(bp[current_best_tag])\n",
    "    current_best_tag = bp[current_best_tag]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "显示结果："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The sentence was: I want to race \n",
      "\n",
      "The best tag sequence is: START PP VB IN NN END \n",
      "\n",
      "The probability of the best tag sequence is: 5.71772824864617e-14\n"
     ]
    }
   ],
   "source": [
    "best_tagsequence.reverse()\n",
    "print( \"The sentence was:\", end = \" \")\n",
    "for w in sentence: print( w, end = \" \")\n",
    "print(\"\\n\")\n",
    "print( \"The best tag sequence is:\", end = \" \")\n",
    "for t in best_tagsequence: print (t, end = \" \")\n",
    "print(\"\\n\")\n",
    "print( \"The probability of the best tag sequence is:\", prob_tagsequence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "结果不是很好，说明要加更多的语料"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
